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Abstract 

Background: Fast, specific identification and surveillance of pathogens is the cornerstone of any outbreak 
response system, especially in the case of emerging infectious diseases and viral epidemics. This process is 
generally tedious and time-consuming thus making it ineffective in traditional settings. The added complexity in 
these situations is the non-availability of pure isolates of pathogens as they are present as mixed genomes or 
hologenomes. Next-generation sequencing approaches offer an attractive solution in this scenario as it provides 
adequate depth of sequencing at fast and affordable costs, apart from making it possible to decipher complex 
interactions between genomes at a scale that was not possible before. The widespread application of next- 
generation sequencing in this field has been limited by the non-availability of an efficient computational pipeline 
to systematically analyze data to delineate pathogen genomes from mixed population of genomes or 
hologenomes. 

Findings: We applied next-generation sequencing on a sample containing mixed population of genomes from an 
epidemic with appropriate processing and enrichment. The data was analyzed using an extensive computational 
pipeline involving mapping to reference genome sets and de-novo assembly. In depth analysis of the data 
generated revealed the presence of sequences corresponding to Japanese encephalitis virus. The genome of the 
virus was also independently de-novo assembled. The presence of the virus was in addition, verified using standard 
molecular biology techniques. 

Conclusions: Our approach can accurately identify causative pathogens from cell culture hologenome samples 
containing mixed population of genomes and in principle can be applied to patient hologenome samples without 
any background information. This methodology could be widely applied to identify and isolate pathogen genomes 
and understand their genomic variability during outbreaks. 

Keywords: Epidemics, Mixed Population Genomes, Hologenome, De novo assembly, Japanese encephalitis, Next 
generation sequencing 



Findings 

Viral pathogens have been a major cause of epidemics 
worldwide [1-3]. The regular surveillance system in 
many cases of viral epidemics is ineffective to directly 
identify the virus unless directed by telltale clinical fea- 
tures and clinical complications [4,5]. In many cases the 
disease causing pathogen is not identified [6], and con- 
tributes to the inadequate and inappropriate 
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management of the condition. Many of the disease out- 
breaks are also caused by changes in the environmental 
niche [7], and at least in some cases cause the emer- 
gence of new pathogens [6,8-11]. It is estimated that at 
least 33 new pathogens have emerged during the last 
three decades [12]. Identification of causative organisms 
during viral outbreak is a problem, primarily due to the 
tedious methods involving isolation and culturing of the 
pathogen [13,14], The relatively low concentration of 
the genetic material, especially in the case of RNA 
viruses and the heavy interference due to the host cell 
nucleic acid and other metagenomes makes sequencing 
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based approaches ineffective, unless done at higher 
depths [15,16]. 

The availability of next-generation sequencing (NGS) 
technology has enabled the scale and ease of addressing 
biological questions on a genomics perspective [17,18]. 
The throughput of sequencing enables deep sequencing 
of nucleic acids, adequate to provide for enough reads 
of the pathogen, even while the interference of the host 
genetic material is very high. Metagenomics has been 
one of the major applications of NGS technology for 
understanding the composition and dynamics of mixed 
population of organisms [19]. The field has now 
emerged to a vibrant area of genomics trying to under- 
stand a large spectrum of environmental niches right 
from human body in both disease as well as healthy 
states to natural geographical niches [20,21]. Although 
NGS technology has been successfully used for addres- 
sing a large diversity of biological questions, its applica- 
tion to address questions pertaining to bio-surveillance 
and emerging infectious diseases on a large scale has 
been limited [22], in spite of the unprecedented oppor- 
tunity provided by the scale and speed of operations 
[17]. 

Hologenome, a term borrowed from evolutionary biol- 
ogy is defined as the sum of the genetic information of 
the host and its microbiota [23]. Hologenomics is an 
emerging field in genomics which deals with mixed 
population of genomes, as in the case of interacting 
populations in host-pathogen and commensals. Holo- 
genome differs from the widely popular term Metagen- 
ome, which involves the study of communities of 
microbes directly in their natural environments [24]. 

Cultured population of viruses co-exists and interacts 
intricately with their host genomes [23]. Although this 
presents a technical challenge in isolating individual 
genomes from mixed populations, it offers enormous 
possibility to understand the interactions and dynamics 
between the genomes in real-time. 

Here we report the sequencing and analysis methodol- 
ogy, involving computational algorithms for reference 
mapping and de novo sequence assembly to accurately 
identify viral pathogens from mixed populations of gen- 
omes. The pipeline relies on the specificity of sequence 
mappings and the differential distribution of the mapped 
reads across genomes. As a proof of concept, we applied 
the methodology on a cell culture hologenome consisting 
of human, bacterial and viral genomes, and could specifi- 
cally identify the viral pathogen. This methodology could 
potentially be applied for rapid and specific identification 
of viral pathogens during epidemic outbreaks. 

Sample collection and RNA isolation 

The sample was collected and isolated during an epi- 
demic of acute viral encephalitis from an anonymous 



patient suffering from fever and acute encephalitis from 
Baba Raghav Das (BRD) Medical College and Nehru 
Hospital, Gorakhpur, India. Sample was procured and 
processed as per ethical procedures laid down by BRD 
Medical College, Gorakhpur, India and National Insti- 
tute of Virology, Pune, India. Using standard virus isola- 
tion protocols samples were inoculated in human 
Rhabdosarcoma (RD) and Baby Hamster kidney (BHK) 
cell lines for virus isolation [25,26]. Cells were observed 
for cytopathological effects (CPE) and passaged three 
times. The cell culture supernatant was filtered using 
0.22 |im Millipore filters for every passage. RNA was 
isolated using Qiagen (QiaAMP viral RNA minikit) kit 
as per manufacturer's instructions. The RNA was eluted 
in 60 (ilitre of AVE buffer. 

Library preparation, sequencing and genome assembly 

The RNA library was prepared according to the manu- 
facturer's instructions using RNA Sample-prep kit (Illu- 
mina Inc, USA) for sequencing on Illumina sequencing 
platform. Two microgram of total RNA was fragmented 
using divalent cation. Cleaved RNA was converted to 
cDNA using reverse transcriptase (SSRT-II Invitrogen) 
and random primers. The fragments were further sub- 
jected to second strand cDNA synthesis using DNA 
polymerase as per manufacturer's instructions. End- 
repairing process followed by A-base addition and adap- 
ter ligation was further performed on the cDNA frag- 
ments. Approximately 350 base pair products were 
separated by gel excision and enriched with PCR to cre- 
ate the final library. 

Clusters were generated on the flow cell using cBot 
Paired end cluster generation kit (Illumina Inc, USA) as 
per manufacturer's instructions. The sequencing runs 
were performed on Illumina Inc, USA) using 76 x 2 
base reads. The sequence-quality files generated was 
transformed to Sanger quality scores using custom 
scripts. 

The paired end reads were mapped to the reference 
datasets using Mapping and Assembly with Qualities 
(MAQ) software [27]. The datasets of 3735 viral gen- 
omes, 2352 Bacterial genomes and the human genome 
corresponding to GRch37/hgl9 build was downloaded 
from NCBI [28] and used for mapping. The mapped 
reads were further analyzed and compared for reads 
that overlapped in each reference set. The genomes that 
mapped the maximum number of reads post-alignment 
were parsed using custom scripts and were further con- 
sidered for analysis. Single nucleotide variations and 
Insertion Deletion (InDel) events were called using 
MAQ scripts. Mappings and functional analysis of the 
variations were performed using custom scripts. The 
entire pipeline for the data generation and analysis is 
summarized in Figure 1. 
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Figure 1 Workflow. Summary of the Methodology for viral sample 
processing and computational analysis. 



Velvet [29], a popularly used de novo assembly algo- 
rithm based on de Bruijn graphs was used for the de 
novo assembly. The entire read data was partitioned into 
smaller subsets for analysis. De-novo assembly was 
attempted on the subsets with different k-mers. The 
data was compiled and compared using in-house scripts. 

RT-PCR validation 

Experimental validation of the virus was performed 
using reverse transcription (RT) -polymerase chain reac- 
tion. Specific primer JEV PM1R (reverse primer): 5'- 
CGGARTCTCCTGCTTCGCTTGG-3' and JEV C1F 
(forward primer): 5'-GGCAGAAAGCAAAA- 
CAAAAGA-3' specific for Japanese encephalitis virus 
were used. RNA was initially reverse transcribed using 
JEV PM1R at 50°C using Superscript II reverse tran- 
scriptase (Invitrogen, Life Science Technologies). The 
cDNA was amplified using forward primer JEV C1F and 
reverse primer JEV PM1R. PCR amplification was car- 
ried out by denaturing 94°C for 5 min, followed by 35 
cycles of 94°C for 30 sec 58°C for 30 sec, 72°C for 1 min 
and final extension of 3 min using Taq DNA polymer- 
ase (Fermentas). 

Results and discussion 

We generated approximately 22 million sequence reads 
from the cell culture hologenome sample [SRA: 



SRX099040]. All reads were 76 bases in paired end 
mode, and amounted to a total of 1.67 Gigabases. The 
paired-end reads had an average insert size of 350 bases. 
The read statistics is available as Additional file 1. The 
reads were mapped to Human, 2352 bacteria and 3735 
viral genome datasets using Mapping and Assembly 
with Qualities (MAQ) with default parameters. Maxi- 
mum of two bases mismatching was permitted in the 
seed, and the alignments were performed in paired end 
mode. All the three alignment files were later trans- 
formed and the read headers, which overlap in each of 
the sets, were compared using in-house scripts. We 
found 2,835,442 reads mapping to the Human genome, 
comprising of 13% of the entire reads, 774,056 reads 
comprising of 3.6% of the data mapping to the bacterial 
reference set, while 10,726,844 mapped to Virus gen- 
omes, comprising 49% of the reads (Figure 2a &2b). We 
also compared the reads that overlapped in each of the 
reference alignment sets. The overlaps were minimal 
suggesting specificity of sequence alignments. The reads 
mapping to the viral sequence dataset comprising of 
3735 complete viral reference genomes were further 
analyzed in depth. The analysis revealed that 99% of the 
reads mapped to the Japanese encephalitis Virus gen- 
ome (Figure 2c). This comprised of over 10 million 
reads and provided an effective coverage of over 
70,000X over the Japanese encephalitis genome. 




Japanese Encephalitis Genome 



// 



3735 Virus Genomes 



Figure 2 Reads mapping and analysis. Mapping and analysis of 
reads mapping to reference sequence datasets. (A) The distribution 
of reads mapping to each reference dataset. (B) Reads mapping to 
each of the three reference datasets and the overlaps between 
them. (C) Distribution of reads mapping to the viral reference 
datasets. The horizontal axis depicts the 3735 viral genomes while 
the vertical axis depicts the reads mapping to each of the viral 
genomes. The reads that map to Japanese encephalitis virus was 
ordered different from the reads that mapped to the next highest. 
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Reference mapping is an attractive strategy, as it is fast 
and frugal on compute and memory requirements, how- 
ever is limited by the availability of the genome under 
consideration in the reference dataset. This could limit 
the widespread application of this methodology in iden- 
tifying pathogens that have not been sequenced before, 
as in the case of new and emerging infectious diseases. 
This limitation could be potentially overcome using a de 
novo assembly strategy, which does not rely on prior 
knowledge of the genome sequence. We therefore 
attempted de novo assembly of the cell culture hologen- 
ome using Velvet. Different k-mer values and coverage 
cut-offs were used for the assembly. The largest contig 
was assembled at a k-mer of 27 and had a length of 
10,758 bases [GenBank: JN644310]. The de novo 
assembled contig aligned with 99% identity to the Japa- 
nese encephalitis virus isolate 014178 genome in NCBI 
nr database, having a length of 10,976 bases, further 
confirming the identity of the virus. We also attempted 
to validate the identity of virus using specific reverse 
transcriptase PCR (RT-PCR). JE specific primers map- 
ping in the C-prM region amplified approximately 400 
bp amplicon as expected [25,26], substantiating the 
identity of the virus. The data is represented in Addi- 
tional file 2. 

Furthermore the reads and alignments in the refer- 
ence assembly that mapped to Japanese encephalitis 
genome were analyzed in depth to understand the gen- 
ome organization and genomic variations of the isolate. 
We identified a total of 209 genomic variations in the 
genome with high confidence. The complete list of 
variations, their genomic loci, their type (i.e. synon- 
ymous and non-synonymous) and the genes that har- 
bor them are summarized in Additional file 3. Fast and 
specific identification of viral pathogens would enable 
adequate and timely response to disease outbreaks. 
Next generation sequencing technology coupled with 
efficient computational algorithms could be effectively 
used for the identification of pathogens in such 
epidemics. 

Conclusions 

Here we have used next-generation sequencing 
approach and generated over 22 million sequence reads 
from a cell culture hologenome sample consisting of 
human, bacteria and virus. We also successfully applied 
a pipeline of computational algorithms including refer- 
ence mapping and de novo assembly to identify the 
pathogen as Japanese encephalitis virus. Furthermore we 
also validated the identity of the virus using RT-PCR 
techniques. In summary we successfully demonstrate the 
utility of such approaches in identification of viral 
pathogens from mixed population of genomes. As 
cheaper and faster sequencing technology becomes 



widely available in research labs, this approach would 
open up immense opportunities in identifying viral 
pathogens in outbreaks, apart from being used in regu- 
lar bio-surveillance of infectious diseases and screening 
for agents used in bioterrorism. This methodology could 
also find application in special situations like food pro- 
cessing, beverage, and water quality monitoring for iden- 
tification of food borne and waterborne pathogens 
respectively. 

Availability of supporting data 

Raw sequence data is available at the NCBI Short Read 
Archive with ID: SRX099040. Sequence generated using 
de novo genome assembly tools is available at Genbank 
with ID: JN644310. 

Additional material 



Additional file 1: Read Statistics. Excel file containing detailed 
information on various read statistics. 

Additional file 2: RT-PCR validations. JPG image depicting reverse 
transcriptase validation for Japanese encephalitis. 

Additional file 3: List of Variations. Microsoft DOC file containing table 
on variations identified in Japanese Encephalitis genome (NC_001437). 
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